Adding the option to disable the DNS processor failure or success cache#44932
Adding the option to disable the DNS processor failure or success cache#44932andrewkroh merged 17 commits intoelastic:mainfrom
Conversation
🤖 GitHub commentsExpand to view the GitHub comments
Just comment with:
|
|
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
- QF1008, while I disagree with removing the additional qualification as it makes things more readable, removing the qualifier to appease the linter god.
There was a problem hiding this comment.
Can you please add to the proposed commit message to explain the why part and what is the use case for cache disablement.
beats/.github/PULL_REQUEST_TEMPLATE.md
Line 19 in c203b82
Turning off the caching will significantly limit the throughput of the pipeline. Even if each request takes 1ms to complete, that means the maximum throughput is 1000 EPS.
Also, the documentation for the processor will need updated to include the new configuration parameter.
Added motivation.
|
- document Enabled settings - Notes with warnings on throughput and compounding effects
- document Enabled settings - Notes with warnings on throughput and compounding effects
colleenmcginnis
left a comment
There was a problem hiding this comment.
Some minor suggestions below.
Co-authored-by: Colleen McGinnis <colleen.j.mcginnis@gmail.com>
Co-authored-by: Andrew Kroh <andrew.kroh@elastic.co>
|
@Mergifyio backport 9.0 9.1 |
✅ Backports have been createdDetails
|
This enables use cases that require resolving the current DNS record, regardless of the record's TTL or any previously cached values. It is useful, for example, when monitoring a DNS server or when recorded events must capture the environment's state at a specific moment. When a cache is used, the TTL determines the time frame in which an agent might observe a stale record instead of the current one. This unpredictability can be undesirable when optimizing for rapid time-to-intervention. Disabling the cache has significant throughput implications. The processing time for a single event will be at least the DNS round-trip time. For example, if a DNS request takes 1 ms, the maximum serial throughput is limited to 1000 events/sec. Known use cases for this feature have low throughput requirements. Throughput can be increased by deploying multiple, parallel agents. NOTE: Setting the failure cache TTL to a very low value (e.g., 1ns) achieves a similar, but imperfect, effect. NOTE: While the config allows setting a TTL on the success cache, this option is currently ignored. A future enhancement could honor this setting (e.g., by using min(configured_ttl, record_ttl)), which would align with the behavior of other DNS clients. (cherry picked from commit eee15e7) # Conflicts: # libbeat/processors/dns/dns_test.go
This enables use cases that require resolving the current DNS record, regardless of the record's TTL or any previously cached values. It is useful, for example, when monitoring a DNS server or when recorded events must capture the environment's state at a specific moment. When a cache is used, the TTL determines the time frame in which an agent might observe a stale record instead of the current one. This unpredictability can be undesirable when optimizing for rapid time-to-intervention. Disabling the cache has significant throughput implications. The processing time for a single event will be at least the DNS round-trip time. For example, if a DNS request takes 1 ms, the maximum serial throughput is limited to 1000 events/sec. Known use cases for this feature have low throughput requirements. Throughput can be increased by deploying multiple, parallel agents. NOTE: Setting the failure cache TTL to a very low value (e.g., 1ns) achieves a similar, but imperfect, effect. NOTE: While the config allows setting a TTL on the success cache, this option is currently ignored. A future enhancement could honor this setting (e.g., by using min(configured_ttl, record_ttl)), which would align with the behavior of other DNS clients. (cherry picked from commit eee15e7)
… (#45078) This enables use cases that require resolving the current DNS record, regardless of the record's TTL or any previously cached values. It is useful, for example, when monitoring a DNS server or when recorded events must capture the environment's state at a specific moment. When a cache is used, the TTL determines the time frame in which an agent might observe a stale record instead of the current one. This unpredictability can be undesirable when optimizing for rapid time-to-intervention. Disabling the cache has significant throughput implications. The processing time for a single event will be at least the DNS round-trip time. For example, if a DNS request takes 1 ms, the maximum serial throughput is limited to 1000 events/sec. Known use cases for this feature have low throughput requirements. Throughput can be increased by deploying multiple, parallel agents. NOTE: Setting the failure cache TTL to a very low value (e.g., 1ns) achieves a similar, but imperfect, effect. NOTE: While the config allows setting a TTL on the success cache, this option is currently ignored. A future enhancement could honor this setting (e.g., by using min(configured_ttl, record_ttl)), which would align with the behavior of other DNS clients. (cherry picked from commit eee15e7) Co-authored-by: Michael Bischoff <mjmbischoff@controplex.com> Co-authored-by: Visha Angelova <91186315+vishaangelova@users.noreply.github.com>
Proposed commit message
Adds the option to disable the success and failure cache.
Motivation
This is to enable use cases that require capturing the current point in time dns record regardless of cache or ttl of the record. Such as the case of monitoring the dns server, or with recorded events that need to capture the current state of the environment. TTL captures the time frame over which the old value might be used over the current DNS record, in other words the frame time in which the agent might observe the old or new record based upon whenever the previous request was made. This unpredictability can be undesired when optimizing time-to-intervention.
Disabling the cache will have throughput implications, serial processing an event will be greater than DNS roundtrip time. For example if round-trip time to perform an DNS request is 1 ms, max throughput it limited to 1000/sec. Known use cases have are low throughput requirements. Parallelization, by for example deploying multiple agents, can be used to stretch this number. We would urge to reevaluate the use case and the use of the cache at this point.
NOTE: setting the ttl on the failure cache to 1ns achieves a similar, but imperfect effect.
NOTE: setting the ttl on the success cache is a valid option as per code, it is however ignored as also document in the code. in the documentation it is omitted as an option. Honoring setting and the ttl (min(ttl, dns_record_ttl)) is a different route. Similar to other dns client behaviour.
Checklist
CHANGELOG.next.asciidocorCHANGELOG-developer.next.asciidoc.Disruptive User Impact
non known, the default values leave the old behavior intact and the setting to trigger the new behavior is added in this PR
How to test this PR locally
Define the DNS processor, observe cache stats / resolver requests.
Related issues